In [9]:
%%javascript
IPython.notebook.clear_all_output();

Final Project of Machine Learning

Team members: Ke Fan, Khalil Jalal Majed

First Part: Data Clean

The three data files are the world happiness reports of 2015, 2016 and 2017. Each report is a survey of the state of happiness in most Countries. You can see the data under the directory ./world-happiness/.

The size of happiness data 2015 is 158*12, and its columns are ['Country', 'Region', 'Happiness Rank', 'Happiness Score','Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)','Generosity', 'Dystopia Residual'] .

The size of happiness data 2016 is 157*13, and its columns are ['Country', 'Region', 'Happiness Rank', 'Happiness Score','Lower Confidence Interval', 'Upper Confidence Interval','Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)','Freedom', 'Trust (Government Corruption)', 'Generosity','Dystopia Residual'].

The size of happiness data 2017 is 155*12, and its columns are ['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high','Whisker.low', 'Economy..GDP.per.Capita.', 'Family','Health..Life.Expectancy.', 'Freedom', 'Generosity','Trust..Government.Corruption.', 'Dystopia.Residual'].

Here are the steps of data clean:

  • Firstly, we removed the different columns from these data.
  • Secondly, we unified the columns names for each data.
  • Thirdly, we changed the sequence of several columns to ensure they have same sequence.
  • Fourthly, we combined these data, and we got 470*10 data.
  • At last, we checked if there is missing data. If it does, we will deal with it.

The final columns are ['Country', 'Happiness_Rank', 'Happiness_Score','Economy', 'Family', 'Health', 'Freedom', 'Trust','Generosity', 'Dystopia_Residual'].

In [10]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# read three files
data_2015 = pd.read_csv("./world-happiness/2015.csv")
data_2016 = pd.read_csv("./world-happiness/2016.csv")
data_2017 = pd.read_csv("./world-happiness/2017.csv")

# print the size of files
print("happiness data 2015: ", data_2015.shape)
print("happiness data 2016: ", data_2016.shape)
print("happiness data 2017: ", data_2017.shape)

print("\n")

# print the columns of these data to check the difference 
print("columns of happiness data 2015: ", data_2015.columns)
print("columns of happiness data 2016: ", data_2016.columns)
print("columns of happiness data 2017: ", data_2017.columns)
happiness data 2015:  (158, 12)
happiness data 2016:  (157, 13)
happiness data 2017:  (155, 12)


columns of happiness data 2015:  Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')
columns of happiness data 2016:  Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')
columns of happiness data 2017:  Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')
In [11]:
# remove differences
data_2015 = data_2015.drop(['Region', 'Standard Error'], axis=1)
data_2016 = data_2016.drop(['Region', 'Lower Confidence Interval','Upper Confidence Interval'], axis=1)
data_2017 = data_2017.drop(['Whisker.high','Whisker.low'],  axis=1)

# rename the column names
data_2015.columns = ['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy', 'Family', 'Health', 'Freedom', 'Trust',
       'Generosity', 'Dystopia_Residual']

data_2016.columns = ['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy', 'Family', 'Health', 'Freedom', 'Trust',
       'Generosity', 'Dystopia_Residual']

data_2017.columns = ['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy', 'Family', 'Health', 'Freedom',
       'Generosity', 'Trust', 'Dystopia_Residual']

# swap two columns in data 2017

data_2017 = data_2017[['Country', 'Happiness_Rank', 'Happiness_Score',
       'Economy', 'Family', 'Health', 'Freedom', 'Trust',
       'Generosity', 'Dystopia_Residual']]
In [12]:
# data combination

happiness = pd.concat([data_2015, data_2016, data_2017])

print("The size of the data: ", happiness.shape)
The size of the data:  (470, 10)
In [13]:
# check if there is missing data 
happiness.isna().sum()
Out[13]:
Country              0
Happiness_Rank       0
Happiness_Score      0
Economy              0
Family               0
Health               0
Freedom              0
Trust                0
Generosity           0
Dystopia_Residual    0
dtype: int64
In [14]:
# show several lines as example

happiness.head()
Out[14]:
Country Happiness_Rank Happiness_Score Economy Family Health Freedom Trust Generosity Dystopia_Residual
0 Switzerland 1 7.587 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
1 Iceland 2 7.561 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
2 Denmark 3 7.527 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204
3 Norway 4 7.522 1.45900 1.33095 0.88521 0.66973 0.36503 0.34699 2.46531
4 Canada 5 7.427 1.32629 1.32261 0.90563 0.63297 0.32957 0.45811 2.45176

Second Part: Data Visualization

In this part, we showed several graphics to exhibite detailed information of data.

These graphics include:

...

In [15]:
# add some visualization

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import iplot


# EDA
happiness[['Happiness_Score', 'Economy', 'Family', 'Health','Freedom', 'Trust', 
    'Generosity','Dystopia_Residual']].hist(figsize=(18,12), bins=50, grid=True);
In [16]:
# heatmap
plt.subplots(figsize=(10,8))
sns.heatmap(happiness.corr(),cmap='coolwarm',annot=True);
In [17]:
# The Happiness Score map all over the world 
dmap = dict(type = 'choropleth',
           locations = happiness['Country'],
           locationmode = 'country names',
           z = happiness['Happiness_Score'], 
           text = happiness['Country'],
           colorbar = {'title':'Happiness score'})
layout = dict(title = 'Happiness Map', 
              geo = dict(showframe = False, 
                         projection = {'type': 'equirectangular'}))
choromap = go.Figure(data = [dmap], layout=layout)
iplot(choromap)

Third Part: Data Clustering

This part we used several clustering algorithms to cluster the data.

Include:

  • K-means Clustering: K means is a simple clustering method which depends on the closeness to the center value of a cluster. The initial center point for each cluster is Decided randomly. But for this metond, we should pre-define the number of clusters.
  • Guassian Mixture Modelling: This is a model based clustering, which is using models for clusters, and try to oprimize the fit result of the data. Guassian Mixture Modelling means each cluster can be represented by a Guassian distribution. The data is therefore can be modeled by a misture of this distribution.
  • Agglomerative Clustering: This methond is also called Hierarchical clustering. At the beginning, each point is treated as a separate cluster, and then cluster the points together based on the distances. As the result, the distances among points in a same cluster are minimum, and the distances among clusters are Maximum.Unlike k-means, we don't need to specify the number of clusters.
  • Affinity Propagation: This algorithm mainly depends on the similarity between pairs of data points and Considers all data points as potential examplars. Each pairs exchanged real-valued messages until a high-quality set of exemplars and clusters are reached. This method also doesn't need to be pre-defined the number of clusters.

After each clustering, we also give the graphics to show the clustering results.

In [18]:
# remove irreverent columns
data = happiness.drop(["Country", "Happiness_Rank"], axis=1)
In [19]:
# The number of clusters is defined by Experiments
# I have tried 2, 3, 4, 5 clusters, 3 performs best.

from sklearn.cluster import KMeans

# K-means
def K_means(X, nclust):
    model = KMeans(nclust)
    model.fit(X)
    pred = model.predict(X)
    cens = model.cluster_centers_
    return (pred, cens)

predict, centers = K_means(data, 3)
kmeans = pd.DataFrame(predict)
happiness.insert((happiness.shape[1]),'kmeans',kmeans)
In [20]:
# Plot the relationship betweem Economy and happiness score based on the clustering result of k means
# X is GDP per Capita, and y is Happiness Score

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(happiness['Economy'],happiness['Happiness_Score'],
                     c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('GDP per Capita')
ax.set_ylabel('Happiness Score')
plt.colorbar(scatter)
Out[20]:
<matplotlib.colorbar.Colorbar at 0x12ba95f90>
In [21]:
# The countries clustering result of K-means

dmap1 = [dict(type='choropleth',
             locations = happiness['Country'],
             locationmode = 'country names',
             z = happiness['kmeans'],
             text = happiness['Country'],
             colorbar = {'title':'Clusters'})]
layout1 = dict(title='Countries Clustering based on K-Means',
              geo=dict(showframe = False,
                       projection = {'type':'equirectangular'}))
map1 = go.Figure(data = dmap1, layout=layout1)
iplot(map1)
In [22]:
# Guassian Mixture Modelling

from sklearn.mixture import GaussianMixture

def GMM(X, nclust):
    model = GaussianMixture(n_components=nclust,init_params='kmeans')
    model.fit(X)
    pred = model.predict(X)
    return (pred)

pred1 = GMM(data, 3)
gmm = pd.DataFrame(pred1)
happiness.insert((happiness.shape[1]),'gmm',gmm)
In [23]:
# Plot the relationship betweem Economy and happiness score based on the clustering result of GMM
# X is GDP per Capita, and y is Happiness Score

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(happiness['Economy'],happiness['Happiness_Score'],
                     c=gmm[0],s=50)
ax.set_title('GGM Clustering')
ax.set_xlabel('GDP per Capita')
ax.set_ylabel('Happiness Score')
plt.colorbar(scatter)
Out[23]:
<matplotlib.colorbar.Colorbar at 0x12df43d90>
In [24]:
# The countries clustering result of GGM

dmap2 = [dict(type='choropleth',
             locations = happiness['Country'],
             locationmode = 'country names',
             z = happiness['gmm'],
             text = happiness['Country'],
             colorbar = {'title':'Clusters'})]
layout2 = dict(title='Countries Clustering based on GGM',
              geo=dict(showframe = False,
                       projection = {'type':'equirectangular'}))
map2 = go.Figure(data = dmap2, layout=layout2)
iplot(map2)
In [25]:
# Agglomerative Clustering
# The number of clusters are defined automatically 

from sklearn.cluster import AgglomerativeClustering

def agglomerative(X):
    model = AgglomerativeClustering(affinity = 'euclidean', linkage = 'ward')
    pred = model.fit_predict(X)
    return (pred)

pred2 = agglomerative(data)
agg = pd.DataFrame(pred2)
happiness.insert((happiness.shape[1]),'agg', agg)
In [26]:
# Plot the relationship betweem Economy and happiness score based on 
# the clustering result of Agglomerative Clustering
# X is GDP per Capita, and y is Happiness Score

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(happiness['Economy'],happiness['Happiness_Score'],
                     c=agg[0],s=50)
ax.set_title('Agglomerative Clustering')
ax.set_xlabel('GDP per Capita')
ax.set_ylabel('Happiness Score')
plt.colorbar(scatter)
Out[26]:
<matplotlib.colorbar.Colorbar at 0x12e707f90>
In [27]:
# The countries clustering result of Agglomerative Clustering

dmap3 = [dict(type='choropleth',
             locations = happiness['Country'],
             locationmode = 'country names',
             z = happiness['agg'],
             text = happiness['Country'],
             colorbar = {'title':'Clusters'})]
layout3 = dict(title='Countries Clustering based on Agglomerative Clustering',
              geo=dict(showframe = False,
                       projection = {'type':'equirectangular'}))
map3 = go.Figure(data = dmap3, layout=layout3)
iplot(map3)
In [28]:
# Affinity Propagation
# The number of clusters are defined automatically 

from sklearn.cluster import AffinityPropagation

def affinity(X):
    model = AffinityPropagation(damping = 0.5, max_iter = 250, affinity = 'euclidean')
    model.fit(X)
    pred = model.predict(X)
    return (pred)

pred3 = affinity(data)
aff = pd.DataFrame(pred3)
happiness.insert((happiness.shape[1]),'aff',aff)
In [29]:
# Plot the relationship betweem Economy and happiness score based on 
# the clustering result of Affinity Propagation
# X is GDP per Capita, and y is Happiness Score

fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(happiness['Economy'],happiness['Happiness_Score'],
                     c=aff[0],s=50)
ax.set_title('Affinity Propagation')
ax.set_xlabel('GDP per Capita')
ax.set_ylabel('Happiness Score')
plt.colorbar(scatter)
Out[29]:
<matplotlib.colorbar.Colorbar at 0x12951ecd0>
In [30]:
# The countries clustering result of Affinity Propagation

dmap4 = [dict(type='choropleth',
             locations = happiness['Country'],
             locationmode = 'country names',
             z = happiness['aff'],
             text = happiness['Country'],
             colorbar = {'title':'Clusters'})]
layout4 = dict(title='Countries Clustering based on Affinity Propagation',
              geo=dict(showframe = False,
                       projection = {'type':'equirectangular'}))
map4 = go.Figure(data = dmap4, layout=layout4)
iplot(map4)

Fourth Part: Data Regression

For this part, we used several regression metond to train the data, and predict the happiness scores based on the model.

Include:

  • Linear regression
  • Polynomial Regression
  • Random Forest Regressor
  • Support Vector Regressor (SVR)
  • Decision Tree Regressor

The measure methonds include mean squared error, mean absolute error, and r2 score.

Plus, we also plot the comparison result of predict values and real values.

Besides, for several methonds that have a lot of parameters, we used "GridSearchCV" to tune the parameters automatically.

In [31]:
# Split data into X_train, X_test, y_train, y_test (8:2)
from sklearn.model_selection import train_test_split

y = data['Happiness_Score']
X = data.drop('Happiness_Score', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
In [32]:
# linear regression and measure scores

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression

model1 = LinearRegression()
model1.fit(X_train, y_train)

print("The estimated intercept is", model1.intercept_)
print("The estimated coefficients are", model1.coef_)

pred1 = model1.predict(X_test)

mse_1 = mean_squared_error(y_test, pred1)
mae_1 = mean_absolute_error(y_test, pred1)
r2_1 = r2_score(y_test, pred1)

print("The value of mean squared error is", mse_1)
print("The value of mean absolute error is", mae_1)
print("The r2 score is", r2_1)
The estimated intercept is 4.050380174192014e-05
The estimated coefficients are [1.00007616 1.00002316 0.99988267 0.99993524 0.99979238 1.00017374
 0.9999786 ]
The value of mean squared error is 7.417979833170259e-08
The value of mean absolute error is 0.00022718079309204416
The r2 score is 0.9999999483430703
In [33]:
import matplotlib.pyplot as plt

x = pred1
y = y_test

ax = plt.axes()
ax.scatter(x, y)

ax.plot(x, y)

ax.set_xlabel('Predict Happiness score')
ax.set_ylabel('Actual Happiness score')

ax.axis('tight')
Out[33]:
(2.441521064722567, 7.778567475294159, 2.436123481669853, 7.783876596531437)
In [34]:
# Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(7)
X_poly = poly.fit_transform(X_train)

model2 = LinearRegression()
model2.fit(X_poly, y_train)

pred2 = model2.predict(poly.fit_transform(X_test))

mse_2 = mean_squared_error(y_test, pred2)
mae_2 = mean_absolute_error(y_test, pred2)
r2_2 = r2_score(y_test, pred2)

print("The value of mean squared error is", mse_2)
print("The value of mean absolute error is", mae_2)
print("The r2 score is", r2_2)
The value of mean squared error is 0.0002489135183933576
The value of mean absolute error is 0.006364576683185827
The r2 score is 0.9998266629403669
In [35]:
x_2 = pred2
y = y_test

ax = plt.axes()
ax.scatter(x_2, y)

ax.plot(x_2, y)

ax.set_xlabel('Predict Happiness score')
ax.set_ylabel('Actual Happiness score')

ax.axis('tight')
Out[35]:
(2.4757423732128228, 7.77579089088767, 2.436123481669853, 7.783876596531437)
In [36]:
# Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV


regr = RandomForestRegressor(max_depth=2, random_state=0)


hyperparameters = {'n_estimators' : [20, 30, 40, 50],
                   'max_features' : ['auto', 'sqrt', 'log2'],
                   'max_depth': [None, 15, 10, 5], 
                   'random_state': [10, 15, 20, 25]
                  }

clf = GridSearchCV(regr, hyperparameters)


clf.fit(X_train, y_train)

pred = clf.predict(X_test)

print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()

mse_3 = mean_squared_error(y_test, pred)
mae_3 = mean_absolute_error(y_test, pred)
r2_3 = r2_score(y_test, pred)

print("The value of mean squared error is", mse_3)
print("The value of mean absolute error is", mae_3)
print("The r2 score is", r2_3)
Best parameters set found on development set:

{'max_depth': 10, 'max_features': 'auto', 'n_estimators': 40, 'random_state': 15}

The value of mean squared error is 0.07004145577230987
The value of mean absolute error is 0.193016373632718
The r2 score is 0.9512249070506399
In [37]:
x_1 = pred
y = y_test

ax = plt.axes()
ax.scatter(x_1, y)

ax.plot(x_1, y)

ax.set_xlabel('Predict Happiness score')
ax.set_ylabel('Actual Happiness score')

ax.axis('tight')
Out[37]:
(3.1042160928488847, 7.635681513563336, 2.436123481669853, 7.783876596531437)
In [39]:
from sklearn.svm import SVR
import warnings
warnings.filterwarnings("ignore")

svr = SVR(C=1.0)

parameters = {'kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],
              'gamma' : ['scale', 'auto']
             }

clf2 = GridSearchCV(svr, parameters)
clf2.fit(X_train, y_train)

pred3 = clf2.predict(X_test)


print("Best parameters set found on development set:")
print()
print(clf2.best_params_)
print()

mse_4 = mean_squared_error(y_test, pred3)
mae_4 = mean_absolute_error(y_test, pred3)
r2_4 = r2_score(y_test, pred3)

print("The value of mean squared error is", mse_4)
print("The value of mean absolute error is", mae_4)
print("The r2 score is", r2_4)
Best parameters set found on development set:

{'gamma': 'scale', 'kernel': 'linear'}

The value of mean squared error is 0.0036136568575025276
The value of mean absolute error is 0.05291680423803349
The r2 score is 0.9974835410376856
In [40]:
x_2 = pred3
y = y_test

ax = plt.axes()
ax.scatter(x_2, y)

ax.plot(x_2, y)

ax.set_xlabel('Predict Happiness score')
ax.set_ylabel('Actual Happiness score')

ax.axis('tight')
Out[40]:
(2.5336191938152215, 7.66963193382463, 2.436123481669853, 7.783876596531437)
In [41]:
from sklearn.tree import DecisionTreeRegressor
In [42]:
regr_1 = DecisionTreeRegressor(max_depth=2)

parameters1 = {'criterion' : ['mse', 'friedman_mse', 'mae'],
              'max_depth' : [None, 20, 15, 10, 5],
              'max_features' : ['auto', 'sqrt', 'log2'],
              'random_state': [ 20, 25, 30, 35]
             }

clf3 = GridSearchCV(regr_1, parameters1)
clf3.fit(X_train, y_train)

pred4 = clf3.predict(X_test)


print("Best parameters set found on development set:")
print()
print(clf3.best_params_)
print()

mse_5 = mean_squared_error(y_test, pred4)
mae_5 = mean_absolute_error(y_test, pred4)
r2_5 = r2_score(y_test, pred4)

print("The value of mean squared error is", mse_5)
print("The value of mean absolute error is", mae_5)
print("The r2 score is", r2_5)
Best parameters set found on development set:

{'criterion': 'mae', 'max_depth': 10, 'max_features': 'auto', 'random_state': 25}

The value of mean squared error is 0.21090067215703767
The value of mean absolute error is 0.3279787128935468
The r2 score is 0.8531341221549998
In [43]:
x_3 = pred4
y = y_test

ax = plt.axes()
ax.scatter(x_3, y)

ax.plot(x_3, y)

ax.set_xlabel('Predict Happiness score')
ax.set_ylabel('Actual Happiness score')

ax.axis('tight')
Out[43]:
(2.5953183732764264, 7.759181544707582, 2.436123481669853, 7.783876596531437)
In [44]:
# The comparison results of these 

mean_squared_errors = np.array([mse_1, mse_2, mse_3, mse_4, mse_5])
mean_absolute_errors = np.array([mae_1, mae_2, mae_3, mae_4, mae_5])
r2_scores = np.array([r2_1, r2_2, r2_3, r2_4, r2_5])

fig, ax = plt.subplots()

index = np.arange(5)
bar_width = 0.25

ax.bar(index, mean_squared_errors,  width=bar_width, label='mean_squared_errors')
ax.bar(index + bar_width, mean_absolute_errors, width=bar_width, label='mean_absolute_errors')
ax.bar(index + 2 * bar_width, r2_scores, width=bar_width, label='r2_scores')

ax.set_xlabel('Regression methods')
ax.set_ylabel('Scores')
ax.set_title('Comparison result of Regression methods ')
ax.set_xticklabels(('', 'LR', 'PR', 'RFR', 
                    'SVR', 'DTR'))

ax.legend()
plt.show()
In [ ]: